Modeling Algorithm Performance on Highly-threaded Many-core Architectures
نویسندگان
چکیده
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Examples of Highly-threaded Many-core Architectures . . . . . . . . . . . . 4 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Methodology for Performance Modeling . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Find Key Factors of Performance . . . . . . . . . . . . . . . . . . . . 10 1.3.2 Correlate 3 Spaces of Parameters . . . . . . . . . . . . . . . . . . . . 13 1.3.3 Define Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Contribution and Dissertation Structure . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Background and Related Work . . . . . . . . . . . . . . . . . . . 17 2.1 GPU Architectures and Programming Model . . . . . . . . . . . . . . . . . . 17 2.2 Abstract Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Sequential Machine Models . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Parallel Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 GPU Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Calibrated Performance Models . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Algorithms for Memory Constrained Applications . . . . . . . . . . . . . . . 25 Chapter 3: Threaded Many-core Memory (TMM) Model . . . . . . . . . . . 27 3.1 Abstraction of Highly-threaded Many-core Machines . . . . . . . . . . . . . . 27 ii 3.1.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.3 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 TMM Analysis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 4: Application of the TMM Model . . . . . . . . . . . . . . . . . . . 36 4.1 All-pairs Shortest Path (APSP) . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.1 Dynamic Programming via Matrix Multiplication . . . . . . . . . . . 37 4.1.2 Johnson’s Algorithm: Dijkstra’s Algorithm (Binary Heaps) . . . . . . 40 4.1.3 Johnson’s Algorithm: Dijkstra’s Algorithm (Arrays) . . . . . . . . . . 42 4.1.4 n Iterations of Bellman-Ford Algorithm . . . . . . . . . . . . . . . . . 45 4.1.5 Comparison of Various Algorithms . . . . . . . . . . . . . . . . . . . 47 4.1.6 Effect of Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.7 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.1 Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.2 Suffix Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.3 Comparison and Empirical Validation . . . . . . . . . . . . . . . . . . 70 4.3 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.1 Blocked Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.2 Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 List Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6 Analysis of Additional Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 5: Calibrated Performance Model . . . . . . . . . . . . . . . . . . . . 83 5.1 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1.1 Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1.2 Model Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Model Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.1 Synthetic Micro-benchmark for Hashing . . . . . . . . . . . . . . . . 92 5.2.2 Parallel Bloom Filters Algorithm Design and Implement . . . . . . . 97 5.2.3 Bloom Filters in BLAST . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.4 Model Use to Evaluate Performance Tradeoffs . . . . . . . . . . . . . 113
منابع مشابه
High-Order Finite-differences on multi-threaded architectures using OCCA
High-order finite-difference methods are commonly used in wave propagators for industrial subsurface imaging algorithms. Computational aspects of the reduced linear elastic vertical transversely isotropic propagator are considered. Thread parallel algorithms suitable for implementing this propagator on multi-core and many-core processing devices are introduced. Portability is addressed through ...
متن کاملAddressing Processor Over-provisioning on Large-scale Multi-core Platforms
Modern micro-architectures have embraced multi-core processors and thread-level parallelism for performance growth, because of the difficulty of increasing single core performance without significantly increasing processor power consumption. To meet the ever growing need for speed, current large-scale computing platforms are Nonuniform Memory Accesses (NUMA) architectures equipped with dozens o...
متن کاملEfficient implementation of sorting on multi-core SIMD CPU architecture
Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times. This paper presents an efficient implementation and detaile...
متن کاملMulti-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures
Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrixmatrix multiplication with a focus on performance portability across different high performance computing archite...
متن کاملEfficient mapping and acceleration of AES on custom multi-core architectures
Multi-core processors can deliver significant performance benefits for multi-threaded software by adding processing power with minimal latency, given the proximity of the processors. Cryptographic applications are inherently complex and involve large computations. Most cryptographic operations can be translated into logical operations, shift operations, and table look-ups. In this paper we desi...
متن کامل